In this exercise, we deal with crimes data obtained by the Isareli police.
Using various machine learning algorithms, we analyze the data to find patterns and predict future crime rates.
Note: we used plotly in this project. Thus, make sure to install it first.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
# question 3
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
import plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# question 4
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, confusion_matrix, accuracy_score
# question 5
from sklearn.ensemble import AdaBoostRegressor
from matplotlib.pyplot import *
# Set random seed to ensure reproducible runs
RSEED = 100
The data include three files.
crimes = pd.read_excel('crimes.xlsx')
crimes
Two column names are changed for convenience.
names_dict = {'יישוב מחושב': 'שם יישוב', 'תאור קבוצה סטטיסטית': 'עבירה'}
crimes.rename(columns=names_dict, inplace=True)
crimes.head()
The column שנת הודעה includes only one unique value and thus will be dropped.
len(crimes['שנת הודעה'].unique())
crimes.drop(['שנת הודעה'], axis=1, inplace=True)
In addition, all rows and columns that are a sum of others, will be dropped because of innacurate values.
crimes.drop(['Total'], axis=1, inplace=True)
crimes = crimes.loc[crimes['שם יישוב'] != 'Total']
crimes = crimes.loc[crimes['עבירה'] != 'Total']
crimes.head()
Next, all missing values that represent the number of crimes (i.e. columns 2014-2019), are replaced with 0.
years = [2014, 2015, 2016, 2017, 2018, 2019]
crimes[years] = crimes[years].replace('-', 0)
crimes.head()
In some cities, there are crimes of category לא ידוע or a missing value (-). These will all be merged into the category שאר העבירות as they are not specific to a known category.
names = crimes['שם יישוב'].unique()
for name in names:
# get rows of specific city
x = crimes.loc[crimes['שם יישוב'] == name]
# get values of crime category '-'
x1 = x.loc[(x['עבירה'] == '-'), years]
# get values of crime category 'rest'
x2 = x.loc[x['עבירה'] == 'שאר העבירות', years]
# get values of crime category 'unknown'
x3 = x.loc[x['עבירה'] == 'לא ידוע', years]
# sum of values
s = x1.values.astype(int) + x2.values.astype(int) + x3.values.astype(int)
# append sum to df
crimes.loc[x2.index.values, years] = s
# drop values of crime category '-' and unknown
crimes = crimes.loc[(crimes['עבירה'] != '-') & (crimes['עבירה'] != 'לא ידוע')]
crimes.head(12)
The crime rates throughout the years are highly correlated. Therefore, we will use the average of years 2014-2018 for our train dataset, and the year 2019 for our test dataset. We've decided to use the average because it's a suffecient statistic and an unbiased estimate for the number of crimes.
crimes[years].corr()
# add average column
cols = crimes.loc[:, years[0:5]]
crimes['ממוצע'] = cols.mean(axis=1)
crimes.head(12)
# create train dataset
crimes_train = crimes[['סמל יישוב', 'שם יישוב', 'עבירה', 'ממוצע']]
crimes_train.head()
# create test dataset
crimes_test = crimes[['סמל יישוב', 'שם יישוב', 'עבירה', 2019]]
# rename column 2019
crimes_test = crimes_test.rename({2019: 'מספר עבירות'}, axis='columns')
crimes_test.head()
For easier implementation, we reshape the dataset to a wide format where each row represents one city and the columns are the different types of crimes.
crimes_train_wide = crimes_train.pivot_table(index=['שם יישוב', 'סמל יישוב'], columns=['עבירה'], values=['ממוצע'])
crimes_train_wide.head()
# replace missing values
crimes_train_wide = crimes_train_wide.replace(np.nan, 0)
# set columns and index
crimes_train_wide.columns = crimes_train_wide.columns.droplevel()
crimes_train_wide.reset_index(inplace=True)
crimes_train_wide.head()
# reshape test dataset
crimes_test_wide = crimes_test.pivot_table(index=['שם יישוב', 'סמל יישוב'], columns=['עבירה'],
values=['מספר עבירות'], aggfunc='first')
# replace nan values with 0
crimes_test_wide = crimes_test_wide.replace(np.nan, 0)
# set columns and index
crimes_test_wide.columns = crimes_test_wide.columns.droplevel()
crimes_test_wide.reset_index(inplace=True)
cities = pd.read_excel('cities.xlsx')
cities.head()
The column שנה includes only one unique value and thus will be dropped.
len(cities['שנה'].unique())
cities.drop(['שנה'], axis=1, inplace=True)
In addition, the columns שם יישוב באנגלית and תעתיק are identical and irrelevant. Therefore, they will be dropped.
cities.drop(['שם יישוב באנגלית', 'תעתיק'], axis=1, inplace=True)
temp = cities.drop(['שם יישוב'], axis=1)
temp_val=temp.corr().columns.values
temp_val[7]='סך הכל אוכלוסייה 8102'#to make sure that the reverse will work also to the number 2018
plt.figure(figsize=(15, 12))
sns.heatmap(temp.corr(), cmap='Blues',xticklabels=[i[::-1] for i in temp_val],
yticklabels=[i[::-1] for i in temp_val])
plt.title('Heatmap of Cities Dataset')
plt.show()
We see that the following columns are highly correlated (a sum of each other), and thus two of them will be dropped.
cols = cities[['סך הכל אוכלוסייה 2018', 'יהודים ואחרים', 'מזה: יהודים', 'ערבים']]
cols.corr()
cities.drop(['יהודים ואחרים', 'מזה: יהודים'], axis=1, inplace=True)
In addition, the following columns are also highly correlated, and thus we will only keep the נפה column. Some of the others have missing values and the מחוז column is not as specific as נפה.
cols = cities[['מחוז', 'נפה', 'אזור טבעי', 'ועדת תכנון', 'אשכול רשויות מקומיות']]
cols.corr()
cities.drop(['מחוז', 'אזור טבעי', 'ועדת תכנון', 'אשכול רשויות מקומיות'], axis=1, inplace=True)
cities.isnull().sum()
For the column השתייכות ארגונית, we assume that the missing values represent 'ללא השתייכות'.
cities['השתייכות ארגונית'] = cities['השתייכות ארגונית'].fillna(19)
A similar approach is used for the column מעמד מונציפאלי where the missing values are assumed to be 'חסר מעמד'.
cities['מעמד מונציפאלי'] = cities['מעמד מונציפאלי'].fillna('חסר מעמד')
For the columns דת יישוב and מרחב משטרה, we added a category of 'אחר' for the missing values.
cities['מרחב משטרה'] = cities['מרחב משטרה'].fillna('אחר')
cities['דת יישוב'] = cities['דת יישוב'].fillna('אחר')
For the column ערבים, if the religion of the city is either 1 - Jewish, or 3 - Bedouin, then the number of arabs is 0.
col = 'ערבים'
cities.loc[cities['דת יישוב'] == 1, col] = cities.loc[cities['דת יישוב'] == 1, col].replace(np.nan, 0)
cities.loc[cities['דת יישוב'] == 3, col] = cities.loc[cities['דת יישוב'] == 3, col].replace(np.nan, 0)
In the column שנת ייסוד there are a few values of 'ותיק' which will be replaced with NaN and later imputed.
cities = cities.replace('ותיק', np.nan)
Next, we decided to drop any rows that have missing values in more than half the columns, and any columns that have missing values in more than half the rows.
cities.shape
cities.dropna(thresh=8, axis=0, inplace=True)
cities.dropna(thresh=800, axis=1, inplace=True)
cities.shape
All the remaining missing values of numeric columns, are imputed using KNN.
imputer = KNNImputer(n_neighbors=5, weights="uniform")
cols = ['סך הכל אוכלוסייה 2018', 'שנת ייסוד', 'קואורדינטות', 'גובה', 'ערבים']
imputed = imputer.fit_transform(cities[cols])
cities[cols] = imputed
cities.isnull().sum()
cities.dtypes
The data types are changed to match the most suitable type.
cities = cities.astype('category')
cities = cities.astype({'סמל יישוב': 'str', 'סך הכל אוכלוסייה 2018': 'float64', 'ערבים': 'float64',
'שנת ייסוד': 'int64', 'קואורדינטות': 'float64', 'גובה': 'float64'})
cities.dtypes
The crimes dataset does not include data for cities that belong to regional councils. Therefore, we will combine the data of all the cities that belong to the same regional council.
regional = cities.groupby(['מעמד מונציפאלי'])
For categorical values we decided to take the mode (i.e. most frequent value) for each regional council.
temp1 = regional[['נפה', 'דת יישוב', 'צורת יישוב שוטפת', 'השתייכות ארגונית', 'מרחב משטרה']].agg(lambda x:x.value_counts().index[0])
For variables that represent the number of citizens, we take the sum.
temp2 = regional[['סך הכל אוכלוסייה 2018', 'ערבים']].sum()
For any other numerical values, we take the mean.
temp3 = regional[['שנת ייסוד', 'קואורדינטות', 'גובה']].mean()
We then combine this data to a regional dataframe.
regional = pd.concat([temp1, temp2, temp3], axis=1)
regional = regional[1:55]
regional.head()
To get the names and corresponding IDs of each regional council, we use the mapping file and match the numbers to the names from the crimes dataset.
# get regional council names from the crimes dataset
names = crimes.loc[crimes['סמל יישוב'] == -2]['שם יישוב'].unique()
# get mapping of 'מעמד מוניציפלי'
mapping = pd.read_excel('mapping.xlsx', sheet_name='מעמד מוניציפלי')
# get the numbers of the regional councils
mapping = mapping.loc[11:66].sort_values(by=' מעמד מוניציפלי')
nums = mapping['Unnamed: 1'].values.astype('float64')
# create df
df = pd.DataFrame([names, nums, nums], index=['שם יישוב', 'מעמד מונציפאלי', 'סמל יישוב']).T
df.head()
# merge names and numbers columns to regional dataset
regionals = pd.merge(regional, df, on='מעמד מונציפאלי')
regionals.head()
Next, we drop any cities that belong to reginal councils from the cities dataset and add the new rows with the regional data.
cities = cities.loc[(cities['מעמד מונציפאלי'] == 99) | (cities['מעמד מונציפאלי'] == 0) | (cities['מעמד מונציפאלי'] == 'חסר מעמד')]
cities.append(regionals, ignore_index=True)
cities.describe()
We see that we have some outliers in the column סך הכל אוכלוסייה 2018 (the max value is significantly larger than the 75% value). Thus, we drop the rows that have more than 300000 citizens.
cities = cities.loc[cities['סך הכל אוכלוסייה 2018'] < 300000]
In the crimes dataset, we update the IDs of the regional councils.
# get regional councils crimes data
crimes_regional = crimes_train_wide.loc[crimes_train_wide['סמל יישוב'] == -2].sort_values(by='שם יישוב')
# update the ID
crimes_regional['מספר יישוב'] = nums
crimes_regional.drop(['סמל יישוב'], axis=1, inplace=True)
crimes_regional = crimes_regional.rename(columns={'מספר יישוב': 'סמל יישוב'})
crimes_regional.head()
Now we drop the previous regional councils data and append the new data.
# drop regional councils data
crimes_train_wide = crimes_train_wide.loc[crimes_train_wide['סמל יישוב'] != -2]
# append new data
crimes_train_wide.append(crimes_regional, ignore_index=True)
We create dummy variables for all categorical columns.
# create dummy variables
cols = {'נפה', 'מעמד מונציפאלי', 'דת יישוב', 'צורת יישוב שוטפת', 'השתייכות ארגונית', 'מרחב משטרה'}
dummy_data = pd.get_dummies(cities, columns=cols, drop_first=True)
dummy_data.head()
In order to merge the datasets, we make sure both IDs are the same type.
# change ID type to float
dummy_data['סמל יישוב'] = dummy_data['סמל יישוב'].astype('float64')
In addition, we drop the city name variable because it is not exactly the same in both datasets.
# drop city name
dummy_data.drop('שם יישוב', axis=1, inplace=True)
crimes_train_wide.drop('שם יישוב', axis=1, inplace=True)
crimes_test_wide.drop('שם יישוב', axis=1, inplace=True)
Lastly, we merge both datasets using the ID סמל יישוב.
train_df = pd.merge(dummy_data, crimes_train_wide, on='סמל יישוב')
train_df.head()
train_df.describe()
# merge also test dataset
test_df = pd.merge(dummy_data, crimes_test_wide, on='סמל יישוב')
We make another merged dataset that does not include dummy variables which will be used in some specific cases later.
cities['סמל יישוב'] = cities['סמל יישוב'].astype('float64')
cities.drop('שם יישוב', axis=1, inplace=True)
df_clustering = pd.merge(cities, crimes_train_wide, on='סמל יישוב')
We standardize the columns for better performance of the algorithms.
train_features = train_df.loc[:, 'סך הכל אוכלוסייה 2018': 'שאר עבירות']
ID = train_df.loc[:, 'סמל יישוב']
# scaling train dataset
scaler = StandardScaler()
scaled = pd.DataFrame(scaler.fit_transform(train_features), columns = train_features.columns)
train_df_scaled = pd.concat([ID, scaled], axis=1)
train_df_scaled.head()
# scale test df
test_features = test_df.loc[:, 'סך הכל אוכלוסייה 2018': 'שאר עבירות']
ID = test_df.loc[:, 'סמל יישוב']
# scaling train dataset
scaler = StandardScaler()
scaled = pd.DataFrame(scaler.fit_transform(test_features), columns = test_features.columns)
test_df_scaled = pd.concat([ID, scaled], axis=1)
The following heatmap represents correlation of features. It seems like some types of crimes are very correlated.
temp = df_clustering.drop(['סמל יישוב'], axis=1)
temp_val=temp.corr().columns.values
temp_val[0]='סך הכל אוכלוסייה 8102'# to make sure that the reverse will work also on the number 2018
plt.figure(figsize=(15, 12))
sns.heatmap(temp.corr(), cmap='Blues', xticklabels=[i[::-1] for i in temp_val],
yticklabels=[i[::-1] for i in temp_val])
plt.title('Heatmap of Cities Dataset')
plt.show()
The following plot shows the distribution of the סך הכל אוכלוסייה 2018 feature.
plt.figure(figsize=(10, 8))
sns.histplot(df_clustering, x='סך הכל אוכלוסייה 2018', color='#b3cde3')
plt.grid(axis='y', alpha=0.5, color='lightgrey')
plt.title('Histogram of Population')
plt.xlabel('Population')
plt.show()
The following plot shows the overall number of crimes for the 10 cities with the most crimes.
top10_df = crimes.groupby('שם יישוב').sum()[[2019]].sort_values(by=[2019], ascending=False).head(11)
top10_df = top10_df.drop('אחר')
plt.figure(figsize=(10, 8))
plt.bar([i[::-1] for i in top10_df.index], top10_df[2019], color='#b3cde3', edgecolor='black')
plt.title('Cities with Most Overall Crimes in 2019')
plt.xlabel('City')
plt.ylabel('Overall Number of Crimes')
plt.grid(axis='y', alpha=0.5, color='lightgrey')
plt.show()
In this section, we attempt to identify clusters in the data, using K-Means and GMM.
We define to functions to avoid code repetition:
The fit_model function recieves as input the data, the clustering model ('kmeans' or 'gmm') and the number of desired clusters.
The function fits the model to the data and returns the data with an added column of clusters.
def fit_model(X, mod, k):
if mod == 'kmeans':
model = KMeans(n_clusters=k, random_state=RSEED)
elif mod == 'gmm':
model = GaussianMixture(n_components=k, random_state=RSEED)
else:
raise ValueError('Model not found')
model.fit(X)
clusters = model.predict(X)
X["Cluster"] = clusters
return X
The perform_pca function, recieves as input the data and performs PCA with 2 components. It returns the dataframe with two columns with the first and second principal component. This is used to visualize the clusters.
def perform_pca(X):
pca_2d = PCA(n_components=2, random_state=RSEED)
PCs_2d = pd.DataFrame(pca_2d.fit_transform(X.drop(["Cluster"], axis=1)))
PCs_2d.columns = ["PC1_2d", "PC2_2d"]
X = pd.concat([X, PCs_2d], axis=1, join='inner')
return X
The first clustering algorithm that we implement is K-Means. We use $K=3,4,5$.
df = fit_model(train_df.drop(['סמל יישוב'], axis=1), 'kmeans', 3)
X = perform_pca(df)
kmeans_df3 = X.copy()
cluster0 = X[X["Cluster"] == 0]
cluster1 = X[X["Cluster"] == 1]
cluster2 = X[X["Cluster"] == 2]
trace0 = go.Scatter(x = cluster0["PC1_2d"], y = cluster0["PC2_2d"],
mode = "markers", name = "Cluster 0",
marker = dict(color = '#fbb4ae'), text = None)
trace1 = go.Scatter(x = cluster1["PC1_2d"], y = cluster1["PC2_2d"],
mode = "markers", name = "Cluster 1",
marker = dict(color = '#b3cde3'), text = None)
trace2 = go.Scatter(x = cluster2["PC1_2d"], y = cluster2["PC2_2d"],
mode = "markers", name = "Cluster 2",
marker = dict(color = '#ccebc5'), text = None)
data = [trace0, trace1, trace2]
title = "Visualizing K-Means with K=3 Using PCA"
layout = dict(title = title,
xaxis = dict(title='PC1', ticklen=5, zeroline=False),
yaxis = dict(title='PC2', ticklen=5, zeroline=False),
plot_bgcolor = '#f0f0f0')
fig = dict(data = data, layout = layout)
iplot(fig)
df = fit_model(train_df.drop(['סמל יישוב'], axis=1), 'kmeans', 4)
X = perform_pca(df)
kmeans_df4 = X.copy()
cluster0 = X[X["Cluster"] == 0]
cluster1 = X[X["Cluster"] == 1]
cluster2 = X[X["Cluster"] == 2]
cluster3 = X[X["Cluster"] == 3]
trace0 = go.Scatter(x = cluster0["PC1_2d"], y = cluster0["PC2_2d"],
mode = "markers", name = "Cluster 0",
marker = dict(color = '#fbb4ae'), text = None)
trace1 = go.Scatter(x = cluster1["PC1_2d"], y = cluster1["PC2_2d"],
mode = "markers", name = "Cluster 1",
marker = dict(color = '#b3cde3'), text = None)
trace2 = go.Scatter(x = cluster2["PC1_2d"], y = cluster2["PC2_2d"],
mode = "markers", name = "Cluster 2",
marker = dict(color = '#ccebc5'), text = None)
trace3 = go.Scatter(x = cluster3["PC1_2d"], y = cluster3["PC2_2d"],
mode = "markers", name = "Cluster 3",
marker = dict(color = '#decbe4'), text = None)
data = [trace0, trace1, trace2, trace3]
title = "Visualizing K-Means with K=4 Using PCA"
layout = dict(title = title,
xaxis = dict(title='PC1', ticklen=5, zeroline=False),
yaxis = dict(title='PC2', ticklen=5, zeroline=False),
plot_bgcolor = '#f0f0f0')
fig = dict(data = data, layout = layout)
iplot(fig)
df = fit_model(train_df.drop(['סמל יישוב'], axis=1), 'kmeans', 5)
X = perform_pca(df)
kmeans_df5 = X.copy()
cluster0 = X[X["Cluster"] == 0]
cluster1 = X[X["Cluster"] == 1]
cluster2 = X[X["Cluster"] == 2]
cluster3 = X[X["Cluster"] == 3]
cluster4 = X[X["Cluster"] == 4]
trace0 = go.Scatter(x = cluster0["PC1_2d"], y = cluster0["PC2_2d"],
mode = "markers", name = "Cluster 0",
marker = dict(color = '#fbb4ae'), text = None)
trace1 = go.Scatter(x = cluster1["PC1_2d"], y = cluster1["PC2_2d"],
mode = "markers", name = "Cluster 1",
marker = dict(color = '#b3cde3'), text = None)
trace2 = go.Scatter(x = cluster2["PC1_2d"], y = cluster2["PC2_2d"],
mode = "markers", name = "Cluster 2",
marker = dict(color = '#ccebc5'), text = None)
trace3 = go.Scatter(x = cluster3["PC1_2d"], y = cluster3["PC2_2d"],
mode = "markers", name = "Cluster 3",
marker = dict(color = '#decbe4'), text = None)
trace4 = go.Scatter(x = cluster4["PC1_2d"], y = cluster4["PC2_2d"],
mode = "markers", name = "Cluster 4",
marker = dict(color = '#fed9a6'), text = None)
data = [trace0, trace1, trace2, trace3, trace4]
title = "Visualizing K-Means with K=5 Using PCA"
layout = dict(title = title,
xaxis = dict(title='PC1', ticklen=5, zeroline=False),
yaxis = dict(title='PC2', ticklen=5, zeroline=False),
plot_bgcolor = '#f0f0f0')
fig = dict(data = data, layout = layout)
iplot(fig)
The second clustering algorithm that we implement is GMM. Again, we use $K=3,4,5$.
df = fit_model(train_df.drop(['סמל יישוב'], axis=1), 'gmm', 3)
X = perform_pca(df)
gmm_df3 = X.copy()
cluster0 = X[X["Cluster"] == 0]
cluster1 = X[X["Cluster"] == 1]
cluster2 = X[X["Cluster"] == 2]
trace0 = go.Scatter(x = cluster0["PC1_2d"], y = cluster0["PC2_2d"],
mode = "markers", name = "Cluster 0",
marker = dict(color = '#fbb4ae'), text = None)
trace1 = go.Scatter(x = cluster1["PC1_2d"], y = cluster1["PC2_2d"],
mode = "markers", name = "Cluster 1",
marker = dict(color = '#b3cde3'), text = None)
trace2 = go.Scatter(x = cluster2["PC1_2d"], y = cluster2["PC2_2d"],
mode = "markers", name = "Cluster 2",
marker = dict(color = '#ccebc5'), text = None)
data = [trace0, trace1, trace2]
title = "Visualizing GMM with K=3 Using PCA"
layout = dict(title = title,
xaxis = dict(title='PC1', ticklen=5, zeroline=False),
yaxis = dict(title='PC2', ticklen=5, zeroline=False),
plot_bgcolor = '#f0f0f0')
fig = dict(data = data, layout = layout)
iplot(fig)
df = fit_model(train_df.drop(['סמל יישוב'], axis=1), 'gmm', 4)
X = perform_pca(df)
gmm_df4 = X.copy()
cluster0 = X[X["Cluster"] == 0]
cluster1 = X[X["Cluster"] == 1]
cluster2 = X[X["Cluster"] == 2]
cluster3 = X[X["Cluster"] == 3]
trace0 = go.Scatter(x = cluster0["PC1_2d"], y = cluster0["PC2_2d"],
mode = "markers", name = "Cluster 0",
marker = dict(color = '#fbb4ae'), text = None)
trace1 = go.Scatter(x = cluster1["PC1_2d"], y = cluster1["PC2_2d"],
mode = "markers", name = "Cluster 1",
marker = dict(color = '#b3cde3'), text = None)
trace2 = go.Scatter(x = cluster2["PC1_2d"], y = cluster2["PC2_2d"],
mode = "markers", name = "Cluster 2",
marker = dict(color = '#ccebc5'), text = None)
trace3 = go.Scatter(x = cluster3["PC1_2d"], y = cluster3["PC2_2d"],
mode = "markers", name = "Cluster 3",
marker = dict(color = '#decbe4'), text = None)
data = [trace0, trace1, trace2, trace3]
title = "Visualizing GMM with K=4 Using PCA"
layout = dict(title = title,
xaxis = dict(title='PC1', ticklen=5, zeroline=False),
yaxis = dict(title='PC2', ticklen=5, zeroline=False),
plot_bgcolor = '#f0f0f0')
fig = dict(data = data, layout = layout)
iplot(fig)
df = fit_model(train_df.drop(['סמל יישוב'], axis=1), 'gmm', 5)
X = perform_pca(df)
gmm_df5 = X.copy()
cluster0 = X[X["Cluster"] == 0]
cluster1 = X[X["Cluster"] == 1]
cluster2 = X[X["Cluster"] == 2]
cluster3 = X[X["Cluster"] == 3]
cluster4 = X[X["Cluster"] == 4]
trace0 = go.Scatter(x = cluster0["PC1_2d"], y = cluster0["PC2_2d"],
mode = "markers", name = "Cluster 0",
marker = dict(color = '#fbb4ae'), text = None)
trace1 = go.Scatter(x = cluster1["PC1_2d"], y = cluster1["PC2_2d"],
mode = "markers", name = "Cluster 1",
marker = dict(color = '#b3cde3'), text = None)
trace2 = go.Scatter(x = cluster2["PC1_2d"], y = cluster2["PC2_2d"],
mode = "markers", name = "Cluster 2",
marker = dict(color = '#ccebc5'), text = None)
trace3 = go.Scatter(x = cluster3["PC1_2d"], y = cluster3["PC2_2d"],
mode = "markers", name = "Cluster 3",
marker = dict(color = '#decbe4'), text = None)
trace4 = go.Scatter(x = cluster4["PC1_2d"], y = cluster4["PC2_2d"],
mode = "markers", name = "Cluster 4",
marker = dict(color = '#fed9a6'), text = None)
data = [trace0, trace1, trace2, trace3, trace4]
title = "Visualizing GMM with K=5 Using PCA"
layout = dict(title = title,
xaxis = dict(title='PC1', ticklen=5, zeroline=False),
yaxis = dict(title='PC2', ticklen=5, zeroline=False),
plot_bgcolor = '#f0f0f0')
fig = dict(data = data, layout = layout)
iplot(fig)
Both algorithms had very similar results (i.e. similar clusters).
For each cluster, we calculate the percentage of arabs from the overall population, and check whether it affects the percentage of עבירות נגד אדם and עבירות בטחון from the overall crimes.
We define the function create_res_df that creates a dataframe with the required percentages per cluster.
def create_res_df(df):
dat = pd.concat([df_clustering, df['Cluster']], axis=1)
# group df
grouped = dat.groupby('Cluster')
# calculate arab percentage
percentage = grouped['ערבים'].sum() / grouped['סך הכל אוכלוסייה 2018'].sum()
# calculate sum of all crimes
cols = [col for col in df_clustering.columns if 'עבירות' in col]
offenses = df_clustering[cols]
crimes = grouped[offenses.columns].sum().sum(axis=1)
# calculate crimes percentage
crime1 = grouped['עבירות נגד אדם'].sum() / crimes
crime2 = grouped['עבירות בטחון'].sum() / crimes
# create res df
res = pd.concat([percentage, crime1, crime2], axis=1).sort_values(by=0)
res = res.rename(columns={0: 'אחוז הערבים מכלל האוכלוסייה', 1: 'אחוז עבירות נגד אדם מסך העבירות', 2: 'אחוז עבירות ביטחון מסך העבירות'})
return res
create_res_df(kmeans_df3)
create_res_df(kmeans_df4)
create_res_df(kmeans_df5)
create_res_df(gmm_df3)
create_res_df(gmm_df4)
create_res_df(gmm_df5)
We see that both algorithms had almost identical results (i.e. clusters) and in all of them we see that the higher the percentage of arabs from the overall population, the more crimes of type נגד אדם and ביטחון we have.
In our opinion, the best result is obtained by the K-Means algorithm with $K=3$ because it gives the best and clearest separation. It is not necessary to use more clusters because they do not lead to any improvment.
In this section, we attempt to predict the second most frequent crime for certain cities, using the Random Forest algorithm.
First we create a dataframe that includes only columns with the word עבירות.
cols = [col for col in train_df_scaled.columns if 'עבירות' in col]
offenses = train_df_scaled[cols]
offenses.head()
Next, we select the required rows from the test dataset.
cities_sign = ([1139, 7400, 507, 2640, 43, 3780])
rf_test = test_df[test_df['סמל יישוב'].isin(cities_sign)]
rf_test
Random Forest Regressor
We loop through the different crime types and using the train data, we predict the number of the specific crime type in 2019.
parameters_grid = {
'max_depth': [50, 100, 150, 200],
'min_samples_leaf': [1, 2],
'min_samples_split': [2, 3],
'n_estimators': [50, 100, 150],
'max_features': [5, 10, 15, 20]
}
pred_res = []
MSE = []
# create a loop for each 'עבירות' that is in the dataframe
for i in range(len(offenses.columns)):
# exclude the ith column from the train df
train_cols = [col for col in train_df_scaled.columns if col not in [offenses.columns[i]]]
# train df without the ith column
temp_train = train_df_scaled[train_cols].drop(['סמל יישוב'], axis=1)
# exclude the ith column from the test df
test_cols = [col for col in test_df_scaled.columns if col not in [offenses.columns[i]]]
# test df without the ith column
temp_test = test_df_scaled[test_cols].drop(['סמל יישוב'], axis=1)
# define model
rf = RandomForestRegressor(random_state=RSEED)
# define grid search
grid_search = GridSearchCV(estimator=rf, param_grid=parameters_grid, cv=10, n_jobs=-1)
# fit the model
grid_search.fit(temp_train, train_df[offenses.columns[i]])
# get best model
best = grid_search.best_estimator_
# predict using best model
y_pred = best.predict(temp_test)
# add predictions for the required cities
pred_res.append(y_pred[rf_test.index])
# add MSE of
MSE.append(mean_squared_error(test_df[offenses.columns[i]], y_pred))
The following table presents the predicted crime numbers for each test city.
rf_res = pd.DataFrame()
rf_res = rf_res.append(pred_res).T
rf_res.columns = offenses.columns
rf_res.index = ['כרמיאל','נתניה','כפר יאסיף','ראש העין','מטולה','ביתר עילית']
rf_res
Now, we select the second most frequent crime according to the predictions and compare it to the actual data of 2019.
pred_second = rf_res.apply(lambda row: row.nlargest(2).values[-1], axis=1)
pred_ind = rf_res.apply(lambda row: row.nlargest(2).idxmin(), axis=1)
actual_second = rf_test[cols].apply(lambda row: row.nlargest(2).values[-1], axis=1)
actual_ind = rf_test[cols].apply(lambda row: row.nlargest(2).idxmin(), axis=1)
res_df = pd.DataFrame(pred_ind, columns=['עבירה חזויה'])
res_df['עבירה אמיתית'] = actual_ind.values
res_df['מספר העבירות החזוי'] = pred_second
res_df['מספר העבירות האמיתי'] = actual_second.values
res_df
We see that 5 out of the 6 predictions were accurate, which is a good result.
In addition, we calculated the mean MSE presented below.
rf_mse = pd.DataFrame(round(np.mean(MSE), 3), columns=['Mean MSE'], index=['Random Forest Regressor'])
rf_mse
In this section, we attempt to predict the overall number of crimes in 2019 for certain cities.
First we create a dataframe that includes only columns with the word עבירות (both for the scaled and non-scaled data).
cols = [col for col in train_df_scaled.columns if 'עבירות' in col]
offenses = train_df_scaled[cols]
cols = [col for col in train_df.columns if 'עבירות' in col]
offenses_og = train_df[cols]
Next, we select the required rows from the test dataset.
cities_sign = ([2600, 9000, 7500, 6800, 26])
ab_test = test_df[test_df['סמל יישוב'].isin(cities_sign)]
ab_test
In addition, we define train and test datasets without the crimes columns.
cols = [col for col in train_df_scaled.columns if 'עבירות' not in col]
ab_train_scaled = train_df_scaled[cols]
cols = [col for col in test_df_scaled.columns if 'עבירות' not in col]
ab_test_scaled = test_df_scaled[cols]
The features of the train data do not include the test cities and thus they are removed, and the label is the sum of the crimes.
ab_train_X = ab_train_scaled.loc[~ab_train_scaled.index.isin(ab_test.index)]
ab_train_y = offenses_og.loc[~offenses_og.index.isin(ab_test.index)].sum(axis=1)
Ada-Boost Regressor
parameters_grid = {
'n_estimators': [10, 20, 50, 100, 150, 200],
'learning_rate': [0.0001, 0.001, 0.01, 0.1, 1.0, 1.5, 2]
}
ab = AdaBoostRegressor(random_state=RSEED)
grid_search = GridSearchCV(estimator=ab, param_grid=parameters_grid, cv=10, n_jobs=-1)
grid_search.fit(ab_train_X, ab_train_y)
best = grid_search.best_estimator_
y_pred = best.predict(ab_test_scaled.loc[ab_test_scaled.index.isin(ab_test.index)])
We now compare the predicted vs. the true overall sum of the crimes for each test city.
cols = [col for col in train_df.columns if 'עבירות' in col]
offenses_test = test_df[cols]
y_actual = offenses_test.loc[offenses_test.index.isin(ab_test.index)].sum(axis=1)
ab_res = pd.DataFrame(y_pred, columns={'מספר העבירות החזוי'})
ab_res.index=['אילת','באר שבע','סח\'נין','קרית אתא','ראש פינה']
ab_res['מספר העבירות האמיתי'] = offenses_test.sum(axis=1)[ab_test.index].values
ab_res
In addition, we calculated the mean MSE presented below. We see that the MSE value is very high and the predictions were not very good.
MSE = mean_squared_error(y_actual, y_pred)
ab_mse = pd.DataFrame(round(MSE, 3), columns=['Mean MSE'], index=['Ada-Boost Regressor'])
ab_mse
sum_crimes = crimes.groupby("שם יישוב").sum()
sum_crimes_Q5 = sum_crimes[sum_crimes.index.isin(ab_res.index.values)]
Graphs of the Results
plt = sum_crimes_Q5[sum_crimes_Q5.index.isin(['אילת'])][[2014,2015,2016,2017,2018,2019]].T.plot()
legend(['אילת'[::-1]])
point = pd.DataFrame({'x': [2019], 'y': ab_res[ab_res.index.isin(['אילת'])]['מספר העבירות החזוי']})
plt.vlines(2019, sum_crimes_Q5[sum_crimes_Q5.index.isin(['אילת'])][[2019]],
ab_res[ab_res.index.isin(['אילת'])]['מספר העבירות החזוי'], ls='--')
point.plot(x='x', y='y', ax=plt, kind='scatter', label='ערך חזוי'[::-1], color='red')
plt.set_xlabel('Year')
plt.set_ylabel('Number of Crimes')
plt = sum_crimes_Q5[sum_crimes_Q5.index.isin(['באר שבע'])][[2014,2015,2016,2017,2018,2019]].T.plot()
legend(['באר שבע'[::-1]])
point = pd.DataFrame({'x': [2019], 'y': ab_res[ab_res.index.isin(['באר שבע'])]['מספר העבירות החזוי']})
plt.vlines(2019, sum_crimes_Q5[sum_crimes_Q5.index.isin(['באר שבע'])][[2019]],
ab_res[ab_res.index.isin(['באר שבע'])]['מספר העבירות החזוי'], ls='--')
point.plot(x='x', y='y', ax=plt, kind='scatter', label='ערך חזוי'[::-1], color='red')
plt.set_xlabel('Year')
plt.set_ylabel('Number of Crimes')
plt = sum_crimes_Q5[sum_crimes_Q5.index.isin(['סח\'נין'])][[2014,2015,2016,2017,2018,2019]].T.plot()
legend(['סח\'נין'[::-1]])
point = pd.DataFrame({'x': [2019], 'y': ab_res[ab_res.index.isin(['סח\'נין'])]['מספר העבירות החזוי']})
plt.vlines(2019, sum_crimes_Q5[sum_crimes_Q5.index.isin(['סח\'נין'])][[2019]],
ab_res[ab_res.index.isin(['סח\'נין'])][['מספר העבירות החזוי']], ls='--')
point.plot(x='x', y='y', ax=plt, kind='scatter', label='ערך חזוי'[::-1], color='red')
plt.set_xlabel('Year')
plt.set_ylabel('Number of Crimes')
plt = sum_crimes_Q5[sum_crimes_Q5.index.isin(['קרית אתא'])][[2014,2015,2016,2017,2018,2019]].T.plot()
legend(['קרית אתא'[::-1]])
point = pd.DataFrame({'x': [2019], 'y': ab_res[ab_res.index.isin(['קרית אתא'])]['מספר העבירות החזוי']})
plt.vlines(2019, sum_crimes_Q5[sum_crimes_Q5.index.isin(['קרית אתא'])][[2019]],
ab_res[ab_res.index.isin(['קרית אתא'])]['מספר העבירות החזוי'], ls='--')
point.plot(x='x', y='y', ax=plt, kind='scatter', label='ערך חזוי'[::-1], color='red')
plt.set_xlabel('Year')
plt.set_ylabel('Number of Crimes')
plt = sum_crimes_Q5[sum_crimes_Q5.index.isin(['ראש פינה'])][[2014,2015,2016,2017,2018,2019]].T.plot()
legend(['ראש פינה'[::-1]])
point = pd.DataFrame({'x': [2019], 'y': ab_res[ab_res.index.isin(['ראש פינה'])]['מספר העבירות החזוי']})
plt.vlines(2019, sum_crimes_Q5[sum_crimes_Q5.index.isin(['ראש פינה'])][[2019]],
ab_res[ab_res.index.isin(['ראש פינה'])]['מספר העבירות החזוי'], ls='--')
point.plot(x='x', y='y', ax=plt, kind='scatter', label='ערך חזוי'[::-1], color='red')
plt.set_xlabel('Year')
plt.set_ylabel('Number of Crimes')
In this section we attempt to predict the required police forces for each city.
First we create a dataframe that includes only columns with the word עבירות.
cols = [col for col in train_df_scaled.columns if 'עבירות' in col]
offenses = train_df_scaled[cols]
Random Forest Regressor
We loop through the different crime types and using the train data, we predict the number of the specific crime type in 2019.
parameters_grid = {
'max_depth': [50, 100, 150, 200],
'min_samples_leaf': [1, 2],
'min_samples_split': [2, 3],
'n_estimators': [50, 100, 150],
'max_features': [5, 10, 15, 20]
}
pred_res = []
MSE = []
# create a loop for each 'עבירות' that is in the dataframe
for i in range(len(offenses.columns)):
# exclude the ith column from the train df
train_cols = [col for col in train_df_scaled.columns if col not in [offenses.columns[i]]]
# train df without the ith column
temp_train = train_df_scaled[train_cols].drop(['סמל יישוב'], axis=1)
# exclude the ith column from the test df
test_cols = [col for col in test_df_scaled.columns if col not in [offenses.columns[i]]]
# test df without the ith column
temp_test = test_df_scaled[test_cols].drop(['סמל יישוב'], axis=1)
# define model
rf = RandomForestRegressor(random_state=RSEED)
# define grid search
grid_search = GridSearchCV(estimator=rf, param_grid=parameters_grid, cv=10, n_jobs=-1)
# fit the model
grid_search.fit(temp_train, train_df[offenses.columns[i]])
# get best model
best = grid_search.best_estimator_
# predict using best model
y_pred = best.predict(temp_test)
# add predictions
pred_res.append(y_pred)
# add MSE of
MSE.append(mean_squared_error(test_df[offenses.columns[i]], y_pred))
Next, we define the average number of work hours required for each crime type.
hours = np.array([20, 5, 15, 10, 35, 0, 25, 40, 30, 1, 1, 2, 0])
Using the function get_overall_hours, we calculate the overall work hours required per city. We calculate this for out train, test and predicted data.
def get_overall_hours(df):
# calculate hours per crime
i = 0
for col in offenses.columns.values:
df[col] = df[col] * hours[i]
i += 1
# get sum of hours needed
df['שעות עבודה'] = df[offenses.columns].sum(axis=1)
return df.drop(offenses.columns, axis=1)
train_hours = get_overall_hours(train_df)
test_hours = get_overall_hours(test_df)
rf_res = pd.DataFrame()
rf_res = rf_res.append(pred_res).T
rf_res.columns = offenses.columns
pred_hours = get_overall_hours(rf_res)
res_df = pd.DataFrame([train_hours['שעות עבודה'], test_hours['שעות עבודה'], pred_hours['שעות עבודה']],
index=['train', 'test', 'predicted']).T
res_df.head()
We divide the cities into 3 clusters in the following matter:
Let $$\text{ratio}_i = \frac{\text{train}_i}{\text{predicted}_i}$$
where $\text{train}_i$ is the average number of hours in the past for city $i$, and $\text{predicted}_i$ is the predicted number of hours in 2019.
If the ratio is smaller than 0.5 (i.e. we predict more hours in 2019), then the police should increase it's forces in city $i$. If the ratio is greater than 0.5 (i.e. we predict less hours in 2019), then the police should decrease it's forces in city $i$. Otherwise (i.e. we predict no significant change), then the police should make no changes in city $i$.
In addition, we divide to data to the same clusters using the test data instead of the predicted data to evaluate our predictions.
for index, row in res_df.iterrows():
# predicted cluster
if row['train'] / row['predicted'] < 0.5:
res_df.loc[index, 'pred_cluster'] = 'Increase'
elif row['train'] / row['predicted'] > 1.5:
res_df.loc[index, 'pred_cluster'] = 'Decrease'
else:
res_df.loc[index, 'pred_cluster'] = 'Keep'
# actual cluster
if row['train'] / row['test'] < 0.5:
res_df.loc[index, 'actual_cluster'] = 'Increase'
elif row['train'] / row['test'] > 1.5:
res_df.loc[index, 'actual_cluster'] = 'Decrease'
else:
res_df.loc[index, 'actual_cluster'] = 'Keep'
res_df.head()
import matplotlib.pyplot as plt
The following plot shows the predicted number of work hours, compared to the true values.
plt.figure(figsize=(8, 6))
ax = plt.gca()
ax.scatter(res_df['test'], res_df['predicted'], color='#b3cde3')
ax.set_yscale('log')
ax.set_xscale('log')
plt.title('Predicted vs. True Number of Hours')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.grid(alpha=0.5, color='lightgrey')
plt.show()
The following confusion matrix presents the predicted classification of the cities, compared to the true classification.
matrix = confusion_matrix(res_df['actual_cluster'], res_df['pred_cluster'])
plt.figure(figsize=(8, 6))
sns.heatmap(matrix, annot=True, cmap='Blues',
xticklabels=['Increase', 'Decrease', 'Keep'], yticklabels=['Increase', 'Decrease', 'Keep'])
plt.xlabel("True", fontsize=12)
plt.ylabel("Predicted", fontsize=12)
plt.title("Confusion Matrix", fontsize=14)
plt.show()
acc = pd.DataFrame([round(accuracy_score(res_df['actual_cluster'], res_df['pred_cluster']), 3)],
index=['Classification'], columns=['Accuracy'])
acc
We see that most cities were correctly classified (accuracy score 0.894). However, note that the data is quite imbalanced with much more observations in the 'Keep' class.
In this section, we run the Random Forest algorithm from section 4 again. This time, we add the clusters from section 3 (K-Means with $K=3$).
First, we add the clusters to the train and test datasets.
train_df_scaled['Cluster'] = kmeans_df3['Cluster']
test_df_scaled['Cluster'] = kmeans_df3['Cluster']
Then, we create a dataframe that includes only columns with the word עבירות.
cols = [col for col in train_df_scaled.columns if 'עבירות' in col]
offenses = train_df_scaled[cols]
offenses.head()
Next, we select the required rows from the test dataset.
cities_sign = ([1139, 7400, 507, 2640, 43, 3780])
rf_test = test_df[test_df['סמל יישוב'].isin(cities_sign)]
rf_test
Random Forest Regressor
We loop through the different crime types and using the train data, we predict the number of the specific crime type in 2019.
parameters_grid = {
'max_depth': [50, 100, 150, 200],
'min_samples_leaf': [1, 2],
'min_samples_split': [2, 3],
'n_estimators': [50, 100, 150],
'max_features': [5, 10, 15, 20]
}
pred_res = []
MSE = []
# create a loop for each 'עבירות' that is in the dataframe
for i in range(len(offenses.columns)):
# exclude the ith column from the train df
train_cols = [col for col in train_df_scaled.columns if col not in [offenses.columns[i]]]
# train df without the ith column
temp_train = train_df_scaled[train_cols].drop(['סמל יישוב'], axis=1)
# exclude the ith column from the test df
test_cols = [col for col in test_df_scaled.columns if col not in [offenses.columns[i]]]
# test df without the ith column
temp_test = test_df_scaled[test_cols].drop(['סמל יישוב'], axis=1)
# define model
rf = RandomForestRegressor(random_state=RSEED)
# define grid search
grid_search = GridSearchCV(estimator=rf, param_grid=parameters_grid, cv=10, n_jobs=-1)
# fit the model
grid_search.fit(temp_train, train_df[offenses.columns[i]])
# get best model
best = grid_search.best_estimator_
# predict using best model
y_pred = best.predict(temp_test)
# add predictions for the required cities
pred_res.append(y_pred[rf_test.index])
# add MSE of
MSE.append(mean_squared_error(test_df[offenses.columns[i]], y_pred))
The following table presents the predicted crime numbers for each test city.
rf_res = pd.DataFrame()
rf_res = rf_res.append(pred_res).T
rf_res.columns = offenses.columns
rf_res.index = ['כרמיאל','נתניה','כפר יאסיף','ראש העין','מטולה','ביתר עילית']
rf_res
Now, we select the second most frequent crime according to the predictions and compare it to the actual data of 2019.
pred_second = rf_res.apply(lambda row: row.nlargest(2).values[-1], axis=1)
pred_ind = rf_res.apply(lambda row: row.nlargest(2).idxmin(), axis=1)
actual_second = rf_test[cols].apply(lambda row: row.nlargest(2).values[-1], axis=1)
actual_ind = rf_test[cols].apply(lambda row: row.nlargest(2).idxmin(), axis=1)
res_df = pd.DataFrame(pred_ind, columns=['עבירה חזויה'])
res_df['עבירה אמיתית'] = actual_ind.values
res_df['מספר העבירות החזוי'] = pred_second
res_df['מספר העבירות האמיתי'] = actual_second.values
res_df
We see that 5 out of the 6 predictions were accurate, which is a good result.
In addition, we calculated the mean MSE presented below.
rf_mse = pd.DataFrame(round(np.mean(MSE), 3), columns=['Mean MSE'], index=['Random Forest Regressor'])
rf_mse
The MSE without clusters was much smaller which is better.